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Abstract — Inferring the functional specificity of brain regions 
from functional Magnetic Resonance Images (fMRI) data is 
a challenging statistical problem. While the General Linear 
Model (GLM) remains the standard approach for brain mapping, 
supervised learning techniques (a.k.a. decoding) have proven to 
be useful to capture multivariate statistical effects distributed 
across voxels and brain regions. Up to now, much effort has been 
made to improve decoding by incorporating prior knowledge in 
the form of a particular regularization term. In this paper we 
demonstrate that further improvement can be made by account- 
ing for non-linearities using a ranking approach rather than 
the commonly used least-square regression. Through simulation, 
we compare the recovery properties of our approach to linear 
models commonly used in fMRI based decoding. We demonstrate 
the superiority of ranking with a real fMRI dataset. 

Index Terms — fMRI, supervised learning, decoding, ranking 

I. Introduction 

The prediction of behavioral information or cognitive states 
from brain activation images such as those obtained with fMRI 
can be used to assess the specificity of several brain regions for 
certain cognitive or perceptual functions. This kind of analysis 
is implemented by learning a classifier or regression function 
that fits a given target variable given fMRI activations. The 
accuracy of this prediction depends on whether it uses the 
relevant variables i.e. the correct brain regions. Recovering 
the truly predictive pattern has proven to be challenging from 
a statistical point of view: the high dimensionality of the data 
together with the limited number of images makes the problem 
of brain pattern recovery an ill-posed problem. 

So far, the approaches proposed to address this issue have 
relied on linear models, with univariate, i.e. voxel-based, 
Anova (analysis of variance) for hypothesis testing, or, for 
predictive modeling, with the choice of a regularizer using 
a priori domain-specific knowledge, such as the ^i-norm to 
promote sparsity |[T], ||2), total variation to promote spatial 
smoothness ||3). Various data fit terms have been used. Logistic 
Regression (LR) fT\, Linear SVM |4|, Lasso f[\. While 
Linear SVM and LR cannot address the recovery problem for 
multiclass problems, linear regression models assume a linear 
relationship between the quantity to predict and the amplitude 
of the fMRI signals. If the linear relationship does not hold 
in practice, then the estimation of the predictive patterns may 
suffer from a loss of statistical power This can be particularly 



relevant with Blood Oxygen-Level Dependent (BOLD) signals 
observed in fMRI, where a saturation effect is expected as the 
level of signal increases. 

When targets to predict consist of ordered values, as in a 
parametric design, such as clinical scores, pain levels or the 
complexity of a cognitive task, the response to these different 
conditions can reflect the non-linearities in the data. In such 
situation, we propose to use a data fit term, known as loss 
function, not relying on an assumption of linearity but only 
of increasing response. We show on simulations that this new 
formulation opens the door to capturing the non-linearity and 
leads to better recovery of the predictive brain patterns. On 
an fMRI dataset we show that the new formulation leads to 
models with better recovery properties. 

a) Notations: We write vectors in bold, a e M", matrices 
with capital bold letters, A G M"^''. The dot product between 
two vectors is denoted {a,b). We denote by ||a|| = {a, a) 
the £2 norm of a vector 

II. Learning a linear model from fMRI data 

Following standard statistical learning notations we denote 
by Xi e W, 1 < i < n, the data and Hi E y the target 
variables. In this paper, we aim at learning a weight vector 
w S E^* such that the prediction of y can be non-linearly 
related to the value of w^x. The vector w corresponds here to 
a brain map that can be represented in brain space as a volume 
for visualization of the predictive pattern of voxels. It is useful 
to rewrite these quantities in matrix form; more precisely, we 
denote by X e M"^p the design matrix assembled from n 
fMRI volumes and by y e M" the corresponding n targets. In 
other words, each row of X is a p-dimensional sample, i.e., an 
activation map of p voxels related to one stimulus presentation. 

A standard approach to perform the estimation of w leads 
to the following minimization problem 

w = argmin /:(y,X,u;) + A17(w) , A>0 (1) 

w 

where Xil{w) is the regularization term and £(y,X,w) is 
the loss function. The parameter A balances the loss function 
and the penalty r2(w). 

If the explained variable is a linear combination of the 
images, y = Xw + e, we can estimate w using the mean 
squared error loss function £(y,X,w) = \\y — Xw|p. 



However, with fMRI the linearity assumption may not be valid. 
Instead, we model our explained variable as y = F(Xw) + e, 
where F is a non-decreasing function. 

We introduce the use of pairwise loss functions. These loss 
functions only assume the target values to be a non-decreasing 
function of the data. They have been widely used in ranking, 
a type of supervised machine learning problem whose goal 
is to automatically construct an order from the training data. 
A pairwise loss function operates on pairs of images: given 
a pair of images (xi,yi) and (xj,yj), y^ ^ yj we build a 
model that predicts the sign of y, — yj. 

Let I denote the index set of all considered pairs. Note that 
in some settings it might be important to restrict ourselves to 
a selected subgroup of all pairs, e.g. to the pairs of images of 
a single subject or to the pairs of images corresponding to a 
single session. For this purpose we define a,y e M, E I 
to be a weight associated to each pair. We will now present 
the pairwise loss functions used in this article: 

• Pairwise hinge loss |j5). This is a natural extension of the 
loss function used by Support Vector Machines and has 
been successfully used in information retrieval 161. 



T / 
W Xi 



(2) 



where = max{z, 0}. 

Pairwise logistic loss |j7J. This is the pairwise extension 
of the logistic regression loss function. 
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Fig. 1. Correlation between tlie estimated vector w and the ground truth 
w for different loss functions as the number of considered samples increases 
(higher is better) for dimensions 5x5x5 and 7x7x7 respectively. Pairwise 
loss functions outperform linear regression as the number of samples increases 
and tend to a perfect recovery. 



; log(l + exp(w^(xi - Xj))) 



(3) 



When noise is present in the model, the order of two 
samples might get inverted, a phenomenon known as label 
switching. Because this only affects labels that lie close, it is 
natural to penalize more the misclassification of distant labels. 
By setting the sample weights to to a value that increases 
as labels become more separated, we become more robust to 
label switching. In the case of hinge loss functions, several 
strategies for choosing the appropriate weights are discussed 
in ^6J. 

On the implementation side, both pairwise hinge loss and 
pairwise logistic loss can be implemented on top of existing 
SVM and Logistic Regression solvers, respectively, by taking 
the difference of pairs as input values. In practice, we used 
the liblinear ||8) library via the scikit-learn |j9) library. 

III. Simulation 

b) Data generation: The simulated data X contains 
volumes of size 5x5x5 and 7x7x7, each one consisting 
of Gaussian white noise smoothed by a Gaussian kernel 
with standard deviation of 2 voxels. This mimics the spatial 
correlation structure observed in real fMRI data. The simulated 
vector of coefficients w has a support restricted to four cubic 
Re gions of Interest (ROIs) of size (2 x 2 x 2). The values of 
w restricted to these ROIs are {1,1,-1,-1}. 
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Fig. 2. Correlation between the estimated vector w and the ground truth 
w for different loss functions with a noise level of 5%. Appropriately setting 
the weights plays a major role in robustness. Without the correct weighting 
and under noisy conditions, pairwise logistic loss function fails to recover the 
correct model. 



The target variable y/ e M" is simulated as a linear model: 

y, = Xw + e (4) 

where the noise e I^fif] follows a uniform distribu- 
tion. (7 is chosen such that the signal-to-noise ratio verifies 
||e||/||Xi(;|| — 5%. We then define another target variable y„; 
to be a sigmoid function of y;, that is. 



1 



1 + exp (-y/) 



(5) 



For each of the loss functions introduced earlier and 
mean squared error, we compute the correlation coefficient 
p{w,w) = {w,w) / {\\w\\\\w\\). This gives us the goodness of 
fit for the estimated w. A method with correlation coefficient 
of 1 is able to recover perfectly the ground truth. Since we 
are interested in the estimation of w, we restricted ourselves 
to linear models, leaving out models such as kernel ridge or 
support vector regression with Gaussian kernel. 

For all models we added £2 regularization in the form of 
A||ii;|p and we cross-validated A separately for each loss. 
In this setting, the model with mean squared error loss is 
equivalent to ridge regression. In order to focus on the effect of 
non-linearity, we first considered a noiseless simulation using 
trivial weights Uij — 1. 

Once we learned a vector w for each method we compute 
the correlation coefficient with the ground truth for different 
sizes of the training data. The experiment is repeated 10 
times with different initialization of the pseudorandom number 
generator. We compute errorbars and show the results in 
figure [T] As the number of samples increases, the linear 
model stalls and pairwise loss functions outperform MSE 
on both 5x5x5 and 7x7x7 dimension. As expected, 
the higher dimensionality of the second simulation makes 
the correlation coefficient decrease. However, unlike MSE, 
ranking tends to a perfect recovery as the number of samples 
increases. Both pairwise loss functions perform equivalently 
and have a significatively higher correlation coefficient than 
MSE. In the rest of the paper we will use pairwise logistic 
as loss function. As a result, pairwise loss functions should 
be preferred over MSE in situations where underlying model 
is non-linear. Notice that we fixed the non-linearity to be a 
sigmoid function, but the pairwise loss functions only assume 
that this function is non-decreasing. Unlike linear regression 
models, pairwise loss functions are indeed able to learn the 
structure on the non-linear transform F. 

We now consider the model with noise as in Q and use 
non-trivial weights Uij. To account for label switching, we set 
to zero for pairs with too similar labels: 

Jo if \y,-yj\< a 
I 1 otherwise . 

In the case of discrete values, this would be equivalent to zero- 
ing weights for which the labels are adjacent. We now compute 
the correlation coefficient for weighted and unweighted pair- 
wise logistic models and linear ridge regression model. The 
result can be seen in figure |2] for dimension 5x5x5. The 
unweighted logistic model breaks down in presence of noise 
and performs worse than linear ridge regression. On the other 
hand, appropriately setting the weights aij has a major effect 
on robustness, where this model outperforms MSE in a noisy 
setting. Note also that weighted pairwise logistic has smaller 
variance than MSE. 

IV. Results on fMRI data 

This dataset, described in pO) , consists of 34 healthy volun- 
teers scanned while listening to 16 words sentences with five 




Fig. 3. Scores obtained with tlie pairwise logistic on tlie 4 different ROIs. 
The regions with the best predictive power are the temporal pole the anterior 
superior temporal sulcus. 

different levels of complexity. These were 1 word constituent 
phrases (the simplest), 2 words, 4 words, 8 words and 16 
words respectively, corresponding to 5 levels of complexity 
which was used as class label in our experiments. To clarify, a 
sentence with 16 words using 2 words constituents is formed 
by a series of 8 pairs of words. Words in each pair have a 
common meaning but there is no meaning between each pair 
A sentence has therefore the highest complexity when all the 
16 words form a meaningful sentence. 

The dataset consists of 8 manually labeled ROIs, some in- 
formative and some not. For each ROI separately, we split the 
data into 60% training samples, 20% for parameter selection 
and 20% for validation. We trained a pairwise logistic model 
and set the £2 regularization by cross validation. We choose 
Uij to be zero if classes are adjacent, i.e. if jy^ — Vjl < 1 
and if Xi and Xj do not belong to the same subject, in order 
to consider exclusively non-adjacent pairs of images from 
the same subject. In all other cases, was set to one. We 
computed the generalization score on the validation set as the 
mean number of inversions with respect to the order in labels, 
i.e. the sign flips sgn((Xj — Xj)w) ^ sgn{yi — yj) for all 
pairs of images in the validation set. 

We kept the four ROIs with highest scores (see figure [3]l. 
These are: Anterior Superior Temporal Sulcus (aSTS), Tem- 
poral Pole (TP), Inferior Frontal Gyrus Orbitalis (IFGorb) and 
Inferior Frontal Gyrus triangularis (IFG tri). 

In order to inspect the properties of the estimated functions 
F for each ROI, we estimated w using a pairwise logistic 
model. We then projected our data X along this vector 
w and regularized the result using non parametric locally 
weighted scatterplot smoothing (LOWES S). Results in figure]?] 
show that the linearity varies in shape across ROIs which 
suggests that different regions exhibit different sensitivities to 
the complexity parameter under investigation. We see however 
a trend in the figures towards non-linear and non-decreasing 
functions with some saturation effect of the BOLD signal as 
in the temporal pole (TP). 

In the case of the Temporal Pole (TP), which is the ROI 
revealing the highest saturation effect, an F-test on the data 
{Xw,y) reveals that the quadratic polynomial model fits 
better the data than a linear model (p-value < 0.03). As 
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Fig. 4. Data projected along w showing tlie non-linear effect in the 4 regions 
of interest. This projection gives an insight on the relationship between the 
BOLD signal and the explained variable. We observe that the the shape of 
the non-linearity varies across brain regions. Apart from IFG tri, all regions 
show a saturation effect in the BOLD response. 

shown in the simulations, in this particular case pairwise loss 
functions are likely to improve the recovery of active brain 
regions. 

V. Conclusion 

In this paper, we investigated the use of pairwise loss 
functions to improve the problem of brain pattern recovery 
with supervised learning applied to fMRI data. Through 
simulations, we showed the benefit of such loss functions 
when the target to predict is non-linearly related to the voxel 
amplitudes. Experimental results on fMRI data confirmed the 
presence of such non-linear effects in the data which suggest 
that the pairwise approach should improve the identification 
of predictive brain patterns on experimental data. 

This work shows that improvements in recovery of brain 
activation patterns should not only rely on the choice of a 
particular regularizer, but also on an appropriate loss function. 
Here we have only considered £2-penalized models, but a 
natural extension to work with full brain data would be to con- 
sider pairwise loss functions combined with sparse structured 
penalizations which incorporate domain-specific knowledge. 
This opens the path to further improvements and refinements 
in the recovery of brain pattern activation via supervised 
learning. 
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